TableMetadataBuilder #587

c-thiel · 2024-08-26T18:47:48Z

This PR is now ready for first reviews.
Some Remarks:

In add_sort_order and add_partition_spec the Java code re-builds the added sort-order against the current schema by matching column names. This implementation currently does not do this. Adding this feature would require PartitionSpec (bound) to store the schema it was bound against (probably a good idea anyway) and split SortOrder in bound and unbound, where the bound SortOrder also stores the schema it was bound against. Instead, this implementation assumes that provided sort-orders and partition-specs are valid for the current schema. Compatibility with the current schema is tested.
In contrast to java, our add_schema method does not require a new_last_column_id argument. In java there is a todo to achieve the same. I put my reasoning in a comment in the code, feel free to comment on it.
I have a few specifics that would be good to discuss with a Java Maintainer with a Java Maintainer. Its not ready to be merged yet - please wait for my OK too :)
I had to change a few tests that used the builder under the hood - mostly due to the changed new() behaviour that now re-assignes field-ids to start from 0. Some tests started from 1 before. Re-assigning the ids, just like in Java, ensures that fresh metadata always has fresh and correct ids even if they are created manually or re-used from another metadata.

c-thiel · 2024-08-30T11:31:51Z

ToDos:

First PR as discussion basis
Reserved Property handling
Finish discussion about SortOrder and PartitionSpec re-binding
Review from 1-2 Java Maintainers
Review from 1-2 Rust Maintainers

c-thiel · 2024-09-03T06:54:49Z

Fixes #232

liurenjie1024 · 2024-09-06T01:33:09Z

Thanks @c-thiel for this pr, I've skimmed through it and it looks great to me. However this pr is too huge to review(3k lines), would you mind to split them into smaller onces? For example, we can add one pr for methods involved in one TableUpdate action and add enough tests for it? Also it would be better to put refactoring TableMetadataBuilder in a standalone module a pr?

c-thiel · 2024-09-06T20:46:23Z

Thanks for your Feedback @liurenjie1024. This isn't really a refactoring of the builder, it's more a complete rewrite. The old builder allowed to create corrupt metadata in various ways. Splitting it up by TableUpdate would not be straight forward - many tests also in other modules depend on building Metadata. Creating Metadata now always goes through builder methods - it's a different architecture that requires basic support for all methods from the beginning just to keep tests running.

I would currently prefer to keep it as a larger block mainly because:

I don't have much time currently and its going to be more effort
We would need to write auxiliary code to provide non-checked methods so that crate tests don't fail
The total timespan of merging 10 or so PRs is expected to be much larger than putting ~2 full days effort in from a Rust Maintainer and a Java Maintainer to review it as a block.
Patterns are repetitive and can be reviewed together in many cases
A lot of it are tests - the core builder are 1145 lines. Its long - but doable :)

We now have a vision of what it could look like in the end. Before putting any more effort in, we should answer the following questions:

Is the overall structure OK or should we head in a different direction?
My first point of the opening statement: Do we re-write our SortOrder and add the schema to PartitionSpec so that we can match on names like Java does or not?
My second point from the opening statement: How do we handle new_last_column_id

Those points might change the overall design quite a bit and might require a re-write of SortOrder first (split to bound and unbound).

After we answered those questions, and we still think splitting makes sense, I can try to find time to build stacked-PRs. Maybe just splitting normalization / validation in table_metadata.rs from the actual builder would be a leaner option than splitting every single TableUpdate?

c-thiel · 2024-09-08T09:40:20Z

@liurenjie1024 I tried to cut a few things out - but not along the lines of TalbeUpdate. I hope that's OK?

After they are all merged, I'll rebase this PR for the actual builder.

liurenjie1024 · 2024-09-09T03:46:22Z

Hi, @c-thiel Sorry for late reply.

Is the overall structure OK or should we head in a different direction?

I've went through the new builder and I think this is your design is the right direction.

My first point of the opening statement: Do we re-write our SortOrder and add the schema to PartitionSpec so that we can match on names like Java does or not?

To be honest, I don't quite understand the use case. We can ask for background of this in dev channel, but I think this is not a blocker of this pr, we can always add this later.

My second point from the opening statement: How do we handle new_last_column_id

I've took a look at the comments of these two prs: apache/iceberg#6701 apache/iceberg#7445

And I think the reason behavior is the last_column_id is optional, and we calculate it from highest field id when missing. But allowing user to pass last_column_id should be added to be compatible with current behavior, but this should be a feature which could be added later.

liurenjie1024 · 2024-09-09T03:47:46Z

Those points might change the overall design quite a bit and might require a re-write of SortOrder first (split to bound and unbound).

I agree that this should be required, as I mentioned in #550

liurenjie1024 · 2024-09-09T04:00:17Z

After we answered those questions, and we still think splitting makes sense, I can try to find time to build stacked-PRs. Maybe just splitting normalization / validation in table_metadata.rs from the actual builder would be a leaner option than splitting every single TableUpdate?

That sound reasonable to me. If one pr per table update is too much burden, could we split them by components, for example sort oder, partition spec, schema changes?

c-thiel · 2024-09-09T10:43:01Z

@liurenjie1024 thanks for the Feedback!

My first point of the opening statement: Do we re-write our SortOrder and add the schema to PartitionSpec so that we can match on names like Java does or not?

To be honest, I don't quite understand the use case. We can ask for background of this in dev channel, but I think this is not a blocker of this pr, we can always add this later.

The problem in changing it later is that it changes the semantic of the function. Right now we expect source_id to match the current_schema() (which is also the reason why I expose it during build). Java doesn't do this, instead, it looks up the name by id in the schema bound to a SortOrder or PartitionSpec, and then searches for the same name in the new schema to use it.

In my opinion ids are much cleaner than names (we might have dropped and re-added a column with the same name in the meantime), so I am OK with going forward. However, moving over to java semantics will require new endpoints (i.e. add_migrate_partition_spec or so), which takes a bound partition spec in contrast to the unbound spec we currently pass in.

Give me a thumbs up if that's OK for you. I'll also open a discussion in the dev channel to get some more opinions.

c-thiel · 2024-09-09T10:54:08Z

My second point from the opening statement: How do we handle new_last_column_id

I've took a look at the comments of these two prs: apache/iceberg#6701 apache/iceberg#7445

And I think the reason behavior is the last_column_id is optional, and we calculate it from highest field id when missing. But allowing user to pass last_column_id should be added to be compatible with current behavior, but this should be a feature which could be added later.

I don't think we should add the argument to be honest. My reasoning is as follows:
If specified, it could be a way to artificially increase last_assigned_field_id of a TableMetadata. I can't see any motivation behind that. Its just adds a source of confusion of what to specify here - and what to do if its wrong.
The only useful thing to do with it is to check for outdated TableMetadata at the time of constructing the Schema.
I added this check here:
https://github.com/apache/iceberg-rust/pull/587/files#diff-04f26c83b3c614be6f6d6cfb6c4cefef9e01ec2d31395ac487cdcdff2dbae729R442-R451

Maybe add @nastra or @Fokko could add some comments on the intention of that parameter?

Xuanwo · 2024-09-09T12:46:34Z

I have reviewed most PRs that I am confident can be merged. The only one left is #615, for which I need more input.

c-thiel marked this pull request as draft August 26, 2024 18:48

c-thiel force-pushed the ft/table-metadata-builder branch from 6afeeb0 to 4193131 Compare August 30, 2024 10:48

c-thiel marked this pull request as ready for review August 30, 2024 11:33

ZENOTME mentioned this pull request Sep 1, 2024

Add apply interface in transaction #596

Open

c-thiel changed the title ~~WIP: TableMetadataBuilder~~ TableMetadataBuilder Sep 3, 2024

This was referenced Sep 8, 2024

Feat: Normalize TableMetadata #611

Merged

feat: partition compatibility #612

Merged

feat: Reassign field ids for schema #615

Merged

c-thiel mentioned this pull request Sep 23, 2024

feat: Safer PartitionSpec & SchemalessPartitionSpec #645

Merged

c-thiel added 2 commits October 1, 2024 12:55

Merge branch 'feat/schema-reassign-field-ids'

1f49c95

Merge branch 'feat/safe-partition-spec'

fea1817

c-thiel force-pushed the ft/table-metadata-builder branch from a3c1c89 to fea1817 Compare October 1, 2024 19:14

c-thiel added 7 commits October 2, 2024 07:51

New builder

2c9d7e4

Fix test: Field ids start from 0

fd79c33

fix expire_metadata_log

9955676

Fix import

a0e0021

Allow no previous location

edba06c

export TableMetadataBuildResult

0178a1e

Field IDs mast start from 1 for Java compat

15f18ed

c-thiel added 2 commits October 2, 2024 17:54

Fix last_column_id

3543dfd

Add test_expire_metadata_log

e00450b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TableMetadataBuilder #587

TableMetadataBuilder #587

c-thiel commented Aug 26, 2024 •

edited

Loading

c-thiel commented Aug 30, 2024 •

edited

Loading

c-thiel commented Sep 3, 2024

liurenjie1024 commented Sep 6, 2024

c-thiel commented Sep 6, 2024

c-thiel commented Sep 8, 2024

liurenjie1024 commented Sep 9, 2024

liurenjie1024 commented Sep 9, 2024

liurenjie1024 commented Sep 9, 2024

c-thiel commented Sep 9, 2024

c-thiel commented Sep 9, 2024 •

edited

Loading

Xuanwo commented Sep 9, 2024

TableMetadataBuilder #587

Are you sure you want to change the base?

TableMetadataBuilder #587

Conversation

c-thiel commented Aug 26, 2024 • edited Loading

c-thiel commented Aug 30, 2024 • edited Loading

c-thiel commented Sep 3, 2024

liurenjie1024 commented Sep 6, 2024

c-thiel commented Sep 6, 2024

c-thiel commented Sep 8, 2024

liurenjie1024 commented Sep 9, 2024

liurenjie1024 commented Sep 9, 2024

liurenjie1024 commented Sep 9, 2024

c-thiel commented Sep 9, 2024

c-thiel commented Sep 9, 2024 • edited Loading

Xuanwo commented Sep 9, 2024

c-thiel commented Aug 26, 2024 •

edited

Loading

c-thiel commented Aug 30, 2024 •

edited

Loading

c-thiel commented Sep 9, 2024 •

edited

Loading